selection error
A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering
Wang, Zhanliang, Xiao, Jiancong, Jin, Ruochen, Yang, Shu, Hou, Bojian, Shen, Li
Calibration measures whether a model's predicted confidence aligns with its empirical accuracy, and is central to the reliable deployment of large language models (LLMs) in high-stakes domains such as medicine and law. While much recent work focuses on improving LLM calibration, the equally important question of how to evaluate it in realistic settings remains underdeveloped. Open-ended question answering (QA), the most common deployment setting for modern LLMs, is where existing evaluation methods fall short: logit-based metrics need restricted output formats and internal probabilities; verbalized confidence is self-reported and often overconfident; and sampling-based methods rely on task-specific extraction rules without a clear finite-sample target. We introduce Sem-ECE (Semantic-Sampling Expected Calibration Error), a calibration evaluation framework for open-ended QA that samples answers from the model, groups them into semantic classes, and uses the resulting frequencies as confidence. We study two estimators within this framework: Sem$_1$-ECE, the same-sample self-consistency score, and Sem$_2$-ECE, a held-out variant that separates answer selection from confidence evaluation. We prove both are asymptotically unbiased, and further show that they agree on easy questions but diverge on hard ones with Sem$_2$ achieving strictly smaller calibration error, so their gap also serves as a diagnostic for question difficulty. Experiments on three open-ended QA benchmarks across five leading commercial LLMs match our theoretical predictions and show that Sem-ECE outperforms verbalized confidence and existing sampling-based methods, while complementing logit-based evaluation when internal probabilities are unavailable.
Learning to rank quantum circuits for hardware-optimized performance enhancement
Hartnett, Gavin S., Barbosa, Aaron, Mundada, Pranav S., Hush, Michael, Biercuk, Michael J., Baum, Yuval
We introduce and experimentally test a machine-learning-based method for ranking logically equivalent quantum circuits based on expected performance estimates derived from a training procedure conducted on real hardware. We apply our method to the problem of layout selection, in which abstracted qubits are assigned to physical qubits on a given device. Circuit measurements performed on IBM hardware indicate that the maximum and median fidelities of logically equivalent layouts can differ by an order of magnitude. We introduce a circuit score used for ranking that is parameterized in terms of a physics-based, phenomenological error model whose parameters are fit by training a ranking-loss function over a measured dataset. The dataset consists of quantum circuits exhibiting a diversity of structures and executed on IBM hardware, allowing the model to incorporate the contextual nature of real device noise and errors without the need to perform an exponentially costly tomographic protocol. We perform model training and execution on the 16-qubit ibmq_guadalupe device and compare our method to two common approaches: random layout selection and a publicly available baseline called Mapomatic. Our model consistently outperforms both approaches, predicting layouts that exhibit lower noise and higher performance. In particular, we find that our best model leads to a $1.8\times$ reduction in selection error when compared to the baseline approach and a $3.2\times$ reduction when compared to random selection. Beyond delivering a new form of predictive quantum characterization, verification, and validation, our results reveal the specific way in which context-dependent and coherent gate errors appear to dominate the divergence from performance estimates extrapolated from simple proxy measures.
Gas emission reduction machine learning example
The objective of model selection is to find the network architecture with the best generalization properties. We want to improve the final selection error obtained before (0.263 NSE). The best selection error is achieved using a model with the most appropriate complexity to produce a good data fit. Order selection algorithms are responsible for find the optimal number of perceptrons in the neural network. The following chart shows the results of the incremental order algorithm.
Gas emission reduction machine learning example
The objective of model selection is to find the network architecture with the best generalization properties. That is, we want to improve the final selection error obtained before (0.263 NSE). The best selection error is achieved by using a model with the most appropiate complexity to produce an adequate fit of the data. Order selection algorithms are responsible for find the optimal number of perceptrons in the neural network. The following chart shows the results of the incremental order algorithm.
Provably Correct Algorithms for Matrix Column Subset Selection with Selectively Sampled Data
We consider the problem of matrix column subset selection, which selects a subset of columns from an input matrix such that the input can be well approximated by the span of the selected columns. Column subset selection has been applied to numerous real-world data applications such as population genetics summarization, electronic circuits testing and recommendation systems. In many applications the complete data matrix is unavailable and one needs to select representative columns by inspecting only a small portion of the input matrix. In this paper we propose the first provably correct column subset selection algorithms for partially observed data matrices. Our proposed algorithms exhibit different merits and limitations in terms of statistical accuracy, computational efficiency, sample complexity and sampling schemes, which provides a nice exploration of the tradeoff between these desired properties for column subset selection. The proposed methods employ the idea of feedback driven sampling and are inspired by several sampling schemes previously introduced for low-rank matrix approximation tasks (Drineas et al., 2008; Frieze et al., 2004; Deshpande and Vempala, 2006; Krishnamurthy and Singh, 2014). Our analysis shows that, under the assumption that the input data matrix has incoherent rows but possibly coherent columns, all algorithms provably converge to the best low-rank approximation of the original data as number of selected columns increases. Furthermore, two of the proposed algorithms enjoy a relative error bound, which is preferred for column subset selection and matrix approximation purposes. We also demonstrate through both theoretical and empirical analysis the power of feedback driven sampling compared to uniform random sampling on input matrices with highly correlated columns.
Genetic algorithms for feature selection in Data Analytics
Many common applications of predictive analytics, from customer segmentation to medical diagnosis, arise from complex relationships between features (also called variables or characteristics). Feature selection is the process of finding the most relevant variables for a predictive model. These techniques can be used to identify and remove unneeded, irrelevant and redundant features that do not contribute or decrease the accuracy of the predictive model. Mathematically, feature selection is formulated as a combinatorial optimization problem. Here the function to optimize is the generalization performance of the predictive model, represented by the error on a selection data set.
Uncorrelated Group LASSO
Kong, Deguang (Samsung Research America) | Liu, Ji (University of Rochester) | Liu, Bo (Philips Research North America) | Bao, Xuan (Google)
l 2,1 -norm is an effective regularization to enforce a simple group sparsity for feature learning. To capture some subtle structures among feature groups, we propose a new regularization called exclusive group l 2,1 -norm. It enforces the sparsity at the intra-group level by using l 2,1 -norm, while encourages the selected features to distribute in different groups by using l 2 norm at the inter-group level. The proposed exclusivegroup l 2,1 -norm is capable of eliminating the feature correlationsin the context of feature selection, if highly correlated features are collected in the same groups. To solve the generic exclusive group l 2,1 -norm regularized problems, we propose an efficient iterative re-weighting algorithm and provide a rigorous convergence analysis. Experiment results on real world datasets demonstrate the effectiveness of the proposed new regularization and algorithm.
Combined l_1 and greedy l_0 penalized least squares for linear model selection
Pokarowski, Piotr, Mielniczuk, Jan
We introduce a computationally effective algorithm for a linear model selection consisting of three steps: screening--ordering--selection (SOS). Screening of predictors is based on the thresholded Lasso that is l_1 penalized least squares. The screened predictors are then fitted using least squares (LS) and ordered with respect to their t statistics. Finally, a model is selected using greedy generalized information criterion (GIC) that is l_0 penalized LS in a nested family induced by the ordering. We give non-asymptotic upper bounds on error probability of each step of the SOS algorithm in terms of both penalties. Then we obtain selection consistency for different (n, p) scenarios under conditions which are needed for screening consistency of the Lasso. For the traditional setting (n >p) we give Sanov-type bounds on the error probabilities of the ordering--selection algorithm. Its surprising consequence is that the selection error of greedy GIC is asymptotically not larger than of exhaustive GIC. We also obtain new bounds on prediction and estimation errors for the Lasso which are proved in parallel for the algorithm used in practice and its formal version.
Nonparametric sparsity and regularization
Rosasco, Lorenzo, Villa, Silvia, Mosci, Sofia, Santoro, Matteo, verri, Alessandro
It is now common to see practical applications, for example in bioinformatics and computer vision, where the dimensionality of the data is in the order of hundreds, thousands and even tens of thousands. It is known that learning in such a high dimensional regime is feasible only if the quantity to be estimated satisfies some regularity assumptions [24]. In particular, the idea behind, so called, sparsity is that the quantity of interest depends only on a few relevant variables (dimensions). In turn, this latter assumption is often at the basis of the construction of interpretable data models, since the relevant dimensions allow for a compact, hence interpretable, representation. An instance of the above situation is the problem of learning from samples a multivariate function which depends only on a (possibly small) subset of relevant variables. Detecting such variables is the problem of variable selection. Largely motivated by recent advances in compressed sensing [15, 25], the above problem has been extensively studied under the assumption that the function of interest (target function) depends linearly to the relevant variables.
On Efficient Heuristic Ranking of Hypotheses
Chien, Steve A., Stechert, Andre, Mutz, Darren
Voice: (818) 306-6144 FAX: (818) 306-6912 Content Areas: Applications (Stochastic Optimization),Model Selection Algorithms Abstract This paper considers the problem of learning the ranking of a set of alternatives based upon incomplete information (e.g., a limited number of observations). We describe two algorithms for hypothesis rankingand their application for probably approximately correct (PAC)and expected loss (EL) learning criteria. Empirical results are provided to demonstrate the effectiveness of these ranking procedureson both synthetic datasets and real-world data from a spacecraft design optimization problem. 1 INTRODUCTION In many learning applications, the cost of information can be quite high, imposing a requirement that the learning algorithms glean as much usable information as possible with a minimum of data. For example: - In speedup learning, the expense of processing each training example can be significant [Tadepalli921. This paper provides a statistical decision-theoretic framework for the ranking of parametric distributions.